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Abstract 

The multiplicative Newton-like method developed by the author et al. is extended to the situa- 
tion where the dynamics is restricted to the orthogonal group. A general framework is constructed 
without specifying the cost function. Though the restriction to the orthogonal groups makes the 
problem somewhat complicated, an explicit expression for the amount of individual jumps is ob- 
tained. This algorithm is exactly second-order-convergent. The global instability inherent in the 
Newton method is remedied by a Levenberg-Marquardt-type variation. The method thus con- 
structed can readily be applied to the independent component analysis. Its remarkable performance 
is illustrated by a numerical simulation. 



1 Overview 

Many optimization problems take the form, "Find an optimal matrix under the constraints (1).. 
(2).. etc." Some of these can be considered as optimizations on Lie groups. For groups, the 
fundamental manipulation is a multiplication whereas an addition is unnatural. In considera- 
tion of this fact, we have constructed a multiplicative Newton- like algorithm for maximizing the 
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kurtosis (a good barometer for the independence) in [ jT.Akuzawa &: N.Murata,199S| ]. There the 
dynamics takes place on the coset GL{1,M.)'^\GL{N,W). We can apply the techniques devel- 
oped in iT.Akuzawa fc N.Murata,1999| ] to many other optimization problems. The coset structure 



GL{1,W)^\GL{N,M.) is, however, characteristic of the independent component analysis (IC A). It is 
understood by the fact that the independence is nothing to do with the scahng. The redundancy 
resulting from the invariance of the model under the componentwise scaling must be eliminated for 
a rigorous discussion and this redundancy corresponds to GL{1,M.)^ . 

Another way to eliminate this redundancy is the prewhitening. The prewhitening is a linear 
transformation of the observed data which maps the covariance matrix to the unit matrix. If we 
deal with prewhitened data, we can legitimately narrow the sweeping range to the orthogonal group. 
The aim of this letter is the construction of a multiplicative algorithm for the orthogonal groups. 

The framework is as follows. A^-dimensional prewhitened random variables {^i|l < i < N} 
are available and it is anticipated that their origins are some unknown mutually independent com- 
ponents {yi*|l < i < N}. The goal of the ICA is the map {Xi} {^*}- We restrict ourselves 
to the linear independent component analysis. There we want to find a linear transformation 
G* : X = {Xi,--- ,X]sf)' Y* = (Y"]*,-- - ,Y^y = C*X which minimizes some cost function 
that measures the independence. Since we are assuming that the data is already prewhitened, the 
covariance matrix of X is the N x N unit matrix. If we do not take into account errors in the 
prewhitening, the optimal point G* must belong to 0{N). 
Giving up the analytical solution, we consider a sequence, 

C(0), Gil), C(2), C(3), , (1.1) 

which converges to the optimal solution G*. The sequence {G{t)} is generated by the left- 
multiplication of another sequence of orthogonal matrices {D{t)}. Each D(t) is specified by the 
coordinate A(t) which satisfies D{t) = e^*-*^ We assume that A(t) is an x skew-symmetric 
matrix, which implies that D{t) belongs to the identity component of 0{N). In practice the pro- 
cedure is as follows. As an initial condition we set C(0). For t > (t S N+), we introduce A(t) 
and denote G{t) as C(t + 1) = e^^*^C(t). Under these settings, we determine A(t) by using the 
Newton method with respect to the matrix elements of A(t). That is, we evaluate the cost function 
at C(t + 1) by expanding it around G{t) in terms of the elements of A(t) up to the second order. 
Then A(t) is choosen as the (unique) critical point of this second order expansion. We iteratively 
follow these procedures until we obtain a satisfactory solution. 

This letter is organized as follows. In Section |2| we will give a complete description of a new 
multiplicative updating method for the orthogonal groups. This section is the main part of this 
letter. Since our formulation does not depend on the details of the cost function the method can 
be useful for many problems other than the ICA. The performance of our method including the 
second-order-convergence is discussed in Section ^. Section ^ is a survey of possible applications 
of our method. The algorithm constructed in Section § is considered as a pure-Newton method 
on the orthogonal groups. To achive the global convergence, we must modify the method. This is 
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accomplished in Section |^. Section ^ also includes a numerical examination of the performance of 
our method. Section ^ is a summary. 

2 Multiplicative updating on 0{N) 

We assume that the cost function F takes the form, 

N 

F{Y) = Y,E{h{Y,)) , (2.1) 

1=1 

where each fi : M ^ M is an unspecified function. Through this letter we denote by E{-) the 
expectation. We will determine the concrete procedures after the Newton manner. First, we 
introduce maps, R and {Ui{l < i < N)}^s, from A^-dimensional dataset to N x N matrices by 

[RiY)h = E (^If^n) (2.2) 

and 

mY)]ki = UUy) = E(^^^^YuY^ . (2.3) 

The goal is the construction of a sequence {y(t)} of the estimates of the independent components, 
which converges to the optimal point Y*. Within the framework of the linear analysis, we consider 
that this sequence is derived from another sequence {C{t)} of the linear transformation by the 
relation Y{t) = C{t)X, where X are the original data. Thus if we restate the problem, the task is 
to determine a sequence {C{t)}. We assume that for each t G N"*" the estimates of the independent 
components at time t and and the estimates at time t + 1 are related by 

Y{t + 1) = D{t)Y{t) (2.4) 

or equivalently 

C{t + l) = D{t)C{t) , (2.5) 

where D{t) is some orthogonal matrix to be fixed. Our method is characterized by this left- 
multiplicative updating rule. As mentioned in the previous section, we assume that each D{t) 
always belongs to the identity component of the orthogonal group 0{N). This assumption is 
reasonable, for example, if the original data X are already prewhitened in the case of the ICA. 
Anyway, under this restriction D{t) is specified by an x anti-symmetric matrix A(t), which 
satisfies 

exp(A(t)) = D{t) . (2.6) 
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For brevity's sake we will omit the argument (t) and denote Y{t + 1) by Z. F(Z) is expanded in 
terms of as 

F{Z) = F{Y) + ti{AR{Y)) + tr (^R{y)) + ^ AikAuUiki(Y) + 0{A^) . (2.7) 

^ ^ i,k,l 

Through the letter we denote by 0(A'^) polynomials of matrix elements of A which does not contain 
terms with degrees less than k. Do not confuse this with the symbol for the orthogonal groups such 
as 0{N). As in the usual Newton method, we truncate the expansion ( p.7| ) at the second order 
with respect to {Ajj}. Then A in this step is determined as the coordinate of the critical point 
of this truncated expansion. The partial derivative of is more convenient for the purpose. It 
reads 



= Rik + l: [AR + iiA],fc + V AkpUkip + 0(A2) , 



(2.8) 



where we have omitted the argument Y for R and [/. Now let us introduce a map cs (the column 
string) as in the previous article p?.Akuzawa fc N.Murata,1999 ]: 



A 



( ^11 ^12 

All ■■■■ 



Mat(iV,F) 



F 



^ CS(A) = (All ^21 ■ ■ ■ A^x Ai2 A22 



(2.9) 



A 



NN 



Ann / 



where Mat(A^, F) is N x N matrices on some unspecified field F. We denote by the upper subscript 
/ the transposition and by f the complex conjugate. For the orthogonal groups it is rather simple 
to move to the framework of the column string as compared to the case of GL{1,'K)^\GL{N,'K): 
By neglecting 0{A'^) terms, the right-hand-side of ( |2.8D is straightforwardly rewritten as 

Rik + ^[AR + RA]i^ + Y,^kpUkip 

p 

= cs{R) + ^{R' (^In + In(S) R) cs(A) + ( Uk)Tcs{A) 



(2.10) 



l+{k-l)N 



where the symbol stands for the direct sum, 

/ Ui 
U2 

N ^ 



k=l 



\ 



•• 

V 



Un-1 
Un / 



(2.11) 
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T is an x N"^ matrix defined by 



cs(^') = rcs(^) for A e Mai{N, F) , (2.12) 

and In is the N x N unit matrix. We denote the tensor product by ^ as usual. The "transposition" 
T is also considered as an intertwiner between two equivalent representations: 

T{A<^B)T = B<SA. (2.13) 

The orthogonal group 0{N) has less degrees of freedom than the general linear group. The canonical 
basis of the Lie algebra, o{N), of 0{N) is N{N — l)/2 anti-symmetric matrices. We will introduce 
some operators which enable us to move to the coordinates based on the canonical basis on o{N). 
In the first place, we introduce an A^^ x N'^ matrix H by 

H = ^H^''^\ (2.14) 

i>j 

where is a 7r/4 rotation between the j + N{i — l)-th component and the i + N{j — l)-th 

component: 



^kl 



1 

t 





for 


k = 3 + N{i - 


1), 


l = j + M{i - 


1) 


for 


k = j + N{i - 


1), 


l = i + M{j - 


1) 


for 


k = i + N{j - 


1), 


1 = j + M{i - 


1) 


for 


k = i + N{j - 


1), 


l = i + M{j - 


1) 




otherwise. 









(2.15) 



The projection operator Pd, 



Pd = diag(pi,--- ,ppf2) , 

Pk = l for k = N{i-l)+i,l<i<N 
Pk = otherwise , 



(2.16) 



is used to extract the diagonal elements of a matrix from its image by cs. Then the coordinate 
transformation is realized by a multiplication of 

H + Pd (2.17) 

to column string vectors. We need to introduce two more projection operators Ps and Pa defined 

by 

Ps = diag(pi,p2, • • • ,PAr2) (2.18) 

Pa = diag(l -pi,l -p2,--- ,1 -Pats) , (2.19) 
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where 



Pk 



1 if j<i and k = i + N{j-l) 







otherwise. 



(2.20) 



By the left- action of Ps and Pa to column string vectors rotated hy H + Po we can extract, 
respectively, the symmetric components and the anti-symmetric components of the matrices. Then 
the conditions for the critical point of the second-order-expansion, which must be satisfied by A, 
are translated into the following two conditions. First, symmetric components of A must vanish. 
This condition is expressed as 



[{H + Pb)cs(A)] .^(,_,)^ = for i < J ^ Ps{H + Pd)cs(A) = Oj . (2.21) 

Secondly, for the anti-symmetric components the condition for the critical point is transformed to 
[{H + PdMR) + {H + Pi5)VFcs(A)] =0 for i > j , (2.22) 



where we have set 



W =\{B! ®In + In®R) + {®Uk)T . 



(2.23) 



The conditions ( 2.21 ) and ( ^.22| ) are combined into an equation. 



Pa{H + Pd)cs{R) + 



Pa{H + Pd)W{H + Pd)'Pa + Ps 



{H + Pd)cs(A) = . 



Note that 

Pa{H + Pd) = PaH . 
The optimal A is immediately obtained from ( |2.24D : 



cs(A) 



-{H + PdY 



Pa{H + Pd)W{H + Pd)'Pa + Ps 



-1 



Pa{H + Pd)cs{R) 



-H' (PaHWH'Pa + Ps) PAHcsiR) . 



(2.24) 



(2.25) 



(2.26) 



Thus we have obtained the explicit updating rule. By iterating the procedure in this section from 
a starting point sufficiently close to the optimal one, the sequences {C{t)} and {Y(t)} converge to 
the optimal solutions. 
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3 Performance (theoretical aspects) 



The second-order-convergence is one of the main advantages of this method. Indeed, this algo- 
rithm is rigorously second-order-convergent. The proof can be given almost in the same way as in 
[ T.Akuzawa &: N.Murata,1999 ]. So we omit the proof in this letter. 

Sometimes we have to deal with large matrices to apply the technique here constructed. Let us 
examine the situation. The N"^ x A^^ matrix PaHW H' Pa -|- Ps is a direct sum of an N{N — l)/2 x 
N{N - l)/2 matrix and an N{N + l)/2y. N{N + l)/2 unit matrix. Within the N{N - l)/2 x 
N[N — 1) /2 block the number of non-zero off-diagonal elements is no more than N{N — 1){N — 2). 
So this is a very sparse matrix when N becomes large. Of course if N becomes extremely large, 
our method requires quite large memories. But due to the sparseness, it remains to be a practical 
tool for problems with considerably large A^. 



•U >J >J 'J M M 



Figure 1: A = 10. The black dots denote non-zero elements of PaHW H' Pa + Ps- 



As is often the case with the Newton method, the global convergence is not assured by this 
algorithm. Fortunately it is possible to cure this fault. We will show the prescription to the global 
instability in Section ^. 



4 Applications to ICA 



So far we have not specified the cost function beyond the assumption that the cost function is a 
sum of the form (2.1). Many of the cost functions for the independent component analysis belong 
to this class. 
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4.1 Kullback-Leibler information 



The Kullback-Leibler information 

N . N 



I ndy,P(y)|lnP(y)-^lnP,(y0| , (4.1) 
1=1 ^ i=i J 



is a good measure for the independence. Here P is the joint probability density function of {Yi\ and 
Pi is the probability density function of the z-th component. We have already restricted ourselves 
to the case where the jacobian of the transformation equals one. Then the minimization of the 
Kullback-Leibler information is equivalent to the minimization of 



N N 



- I \{dYiP{Y)Y,^nPi{Yi)=Y,E{-\nPi{Yi)) . (4.2) 

i i=l i=l 

Thus we can legitimately transform the Kullback-Leibler information to a cost function of the form 
( |2.1| ), where we should set {/j}'s as 

/,(•) = - In P,(-) . (4.3) 

We must evaluate {Pjj's, their derivatives, and so on to determine the optimal solution. A robust 
estimation of these quantities is possibly not an easy task[ B.W. Silverman, 198^ , |D.Cox,1985 ]. 



4.2 Cumulant of fourth order 

The kurtosis of a random variable A is defined by 

The kurtosis is related to the cumulant of the fourth order, 

Cum^^\A) = E{A^) - ^{E{A^)f , (4.5) 

by 

For prewhitened data the kurtosis equals the cumulant of the fourth order. As is well- 
known ^^Hyvarinen^l997, T.Akuzawa & N.Murata,199£], we can grab independent components in 



many cases by seeking the maximum of the absolute values of the kurtoses. Our method is appli- 
cable by setting 



h = (4.7) 
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for all i. If it is known a priori that all the sources {Y*} have positive kurtoses, we may use the 
kurtosis itself and set 

fi = -K. (4.8) 

For these cost functions, R, {Ui}, and other quantities needed for determining each step are cal- 
culated easily from the observed data. Thus applying our method for this cost function is highly 
practical and reasonable choice. 



5 Levenberg-Marquardt-type variation and perfor- 
mance in practice 

The pure-Newton updating rule (|2.2(]|) has a poor global convergence property. This drawback 



is remedied by the Levenberg-Marquardt-type variation |W.H. Press et a/., 1985]. First, We modify 



( 12:261) as 



cs(A) = -H' [PaHWH'Pa + Ps + ^In'^) ^ PaHcs{R) . (5.1) 

The initial value Aq for A is fixed at some positive value. We also fix a real number a(> 1). (In the 
following example we set Aq = 50 and a = 10.) Then the procedure at time t is as follows: 

i) Calculate A by (|3). 

ii) If F{e^Y{t)) is larger than F{Y{t))^ multiply A by a and go back to i). 

iii) Otherwise, multiply A by 1/a and proceed to the next time step t -|- 1. 

Other parts of the algorithm is completely the same as in the pure-Newton version in Section ^. 

Let us examine the real performance of our method under this setting. For the cost function 
we choose the kurtosis as in Subsection The source signals are three synthesizer-generated wav 
files(Fig.|2|). Pseudo-observed data are generated by mixing the source by a random matrix, 

A = h + S, (5.2) 

where each element of S is distributed uniformly on (—1/2,1/2). The residual crosstalk of the 
signals demixed by our method is 1.29% on average. It takes about 122 seconds (CPU time) for 
one hundred iteration of the same problem on our workstation. For reference, we have also solved 



the same demixing problem by the FastICA|IIurri et a/., 1995]. In this case the residual crosstalk is 



1.36% on average and it takes about 156 seconds for one hundred iteration on the same workstation. 
Since the author's knowledge about the FastICA package is limited, one should not take this result 
seriously. It can, however, be said that our method is quite good also in practice. 
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0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 




0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 



Figure 2: Sample data generated by a synthesizer (by courtesy of N.Murata). 



6 Summary 

We have constructed a new algorithm for finding a critical point of broad classes of cost functions 
on the orthogonal groups. This method is second-order-convergent since it is in essence the Newton 
method. The method here constructed is an extension (or a restriction) of the multiplicative 
updating method developed in our previous work| T7Akuzawa &: N.Murata,1999| ]. The constraint 



for A from the nature of the orthogonal groups makes the problem a little complicated. We have, 
however, obtained a rigorous and explicit updating rule. We have also constructed a Levenberg- 
Marquardt-type variation, which is suitable for practical purpose. The global instability inherent in 
the Newton method is remedied in this version. Since our discussion does not depend on the detail 
of the cost function, this method is applicable to many concrete problems. The relatively mild 
assumption ( |2.lD on the form of the cost function, however, implies that our algorithm is especially 
suitable for the ICA. Its practical utility for the ICA have been illustrated here by a numerical 
simulation. 

To summarize, our algorithm has numerous theoretical virtues such as the rigorous second order 
convergence, the explicit and strict formulation, and so on. It provides, also in practice, fast and 
powerful tools for the ICA and many other problems. 
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